feat(deletions): Retry deleting from Nodestore for certain Snuba errors #102391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

armenzg wants to merge 2 commits into master from seer/retry-snuba-errors-nodestore-deletion

+136 −23

Member

armenzg commented Oct 30, 2025

Snuba can sometimes raise errors when we're trying to remove from the Nodestore.

The changes here will retry certain of those errors.

seer-by-sentry bot and others added 2 commits

October 30, 2025 09:40


          feat(deletions): Retry transient Snuba errors in nodestore deletion t…

94ed85c

…asks


          feat(deletions): Retry deleting from Nodestore for certain Snuba errors

17c4dfd

armenzg self-assigned this

github-actions bot added the Scope: Backend label

vercel bot deployed to Preview

October 30, 2025 13:57

View deployment

cursor bot reviewed

View reviewed changes

tests/sentry/deletions/tasks/test_nodestore.py

    
                              "deletions.nodestore.retry",

                              tags={"type": f"snuba-{type(snuba_error).__name__}"},

                              sample_rate=1,

                          )

Contributor

cursor bot Oct 30, 2025

Bug: Test Fails Due to Unexpected Error Handling

The test_snuba_errors_retry test expects UnqualifiedQueryError("All project_ids from the filter no longer exist") to trigger a retry. However, the code specifically handles this error by logging an info metric and completing normally, which causes the test to fail and contradicts the behavior in test_deletion_with_all_projects_deleted.

codecov bot commented Oct 30, 2025 •

edited

Loading

❌ 2 Tests Failed:

Tests completed	Failed	Passed	Skipped
41439	2	41437	254

View the top 2 failed test(s) by shortest run time

tests.sentry.deletions.tasks.test_nodestore.NodestoreDeletionTaskTest::test_unqualified_query_error

Stack Traces | 3.8s run time

#x1B[1m#x1B[.../deletions/tasks/test_nodestore.py#x1B[0m:123: in test_unqualified_query_error
    with pytest.raises(DeleteAborted):
#x1B[1m#x1B[31mE   Failed: DID NOT RAISE <class 'sentry.exceptions.DeleteAborted'>#x1B[0m

tests.sentry.deletions.tasks.test_nodestore.NodestoreDeletionTaskTest::test_snuba_errors_retry

Stack Traces | 3.88s run time

#x1B[1m#x1B[.../deletions/tasks/test_nodestore.py#x1B[0m:159: in test_snuba_errors_retry
    with pytest.raises(RetryError):
#x1B[1m#x1B[31mE   Failed: DID NOT RAISE <class 'sentry.taskworker.retry.RetryError'>#x1B[0m

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

armenzg commented

View reviewed changes

src/sentry/deletions/tasks/nodestore.py

    
                  # TODO: Add specific error handling for retryable errors and raise RetryTask when appropriate

                  except Exception:

                      metrics.incr(f"{prefix}.error", tags={"type": "unhandled-exception"}, sample_rate=1)

Member Author

armenzg Oct 30, 2025

This metric let me see that something changed in the last couple of days and to this Sentry issue.

src/sentry/deletions/tasks/nodestore.py

    
                          # This is not a transient error - retrying won't help since the project is permanently gone.

                          logger.info("All project_ids from the filter no longer exist")

                          # There may be no value to track this metric, but it's better to be safe than sorry.

                          metrics.incr(f"{prefix}.info", tags={"type": "all-projects-deleted"}, sample_rate=1)

Member Author

armenzg Oct 30, 2025

Reducing the metric from warning to info.

src/sentry/deletions/tasks/nodestore.py

    
                          metrics.incr(f"{prefix}.warning", tags={"type": "all-projects-deleted"}, sample_rate=1)

                          # When deleting groups, if the project gets deleted concurrently (e.g., by another deletion task),

                          # Snuba raises UnqualifiedQueryError with the message "All project_ids from the filter no longer exist".

                          # This happens because the task tries to fetch event IDs from Snuba for a project that no longer exists.

Member Author

armenzg Oct 30, 2025

We have to remember that we delete from the Nodestore as a best effort since eventually the events will ttl.

Alternatively, we could make the nodestore tasks delete the project rather than in the spawning task.

src/sentry/deletions/tasks/nodestore.py

    
                          metrics.incr(f"{prefix}.error", tags={"type": "unqualified-query-error"}, sample_rate=1)

                          # Report to Sentry to investigate

                          raise DeleteAborted(f"{error.args[0]}. We won't retry this task.") from error

                          metrics.incr(f"{prefix}.error", tags={"type": type(error).__name__}, sample_rate=1)

Member Author

armenzg Oct 30, 2025

Using type(error).__name__ instead of unqualified-query-error to distinguish the errors in DD.

src/sentry/deletions/tasks/nodestore.py

    
                      logger.warning(

                          f"{prefix}.retry", extra={**extra, "error_type": error_type, "error": str(error)}

                      )

                      raise RetryTask(f"Snuba error: {error_type}. We will retry this task.") from error

Member Author

armenzg Oct 30, 2025

This is the new feature in this PR. I believe the errors above can be retried (we have an upper bound of how many times we would try).

src/sentry/deletions/tasks/nodestore.py

    
                  ) as error:

                      error_type = type(error).__name__

                      metrics.incr(f"{prefix}.retry", tags={"type": f"snuba-{error_type}"}, sample_rate=1)

                      logger.warning(

Member Author

armenzg Oct 30, 2025

We don't trigger a Sentry event every time this happens since the retried task may succeed.

src/sentry/deletions/tasks/nodestore.py

    
                      raise RetryTask(f"Snuba error: {error_type}. We will retry this task.") from error

                  except Exception as error:

                      metrics.incr(f"{prefix}.error", tags={"type": type(error).__name__}, sample_rate=1)

Member Author

armenzg Oct 30, 2025

Using type(error).__name__ instead of unhandled-exception to distinguish in Datadog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels